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ABSTRACT 

This paper describes the solution method taken by LeBu- 
SiShu team for trackl in ACM KDD CUP 2011 contest (re- 
sulting in the 5th place). We identified two main challenges: 
the unique item taxonomy characteristics as well as the large 
data set size. 

To handle the item taxonomy, we present a novel method 
called Matrix Factorization Item Taxonomy Regularization 
(MFITR). MFITR obtained the 2nd best prediction result 
out of more then ten implemented algorithms. 

For rapidly computing multiple solutions of various algo- 
rithms, we have implemented an open source parallel col- 
laborative filtering library on top of the GraphLab machine 
learning framework. We report some preliminary perfor- 
mance results obtained using the BlackLight supercomputer. 

General Terms 

Machine learning, data mining 

Keywords 

Collaborative filtering, matrix factorization, tensor factor- 
ization. 

1. INTRODUCTION 

The task in the ACM KDD CUP trackl was to predict 
music ratings using a real dataset obtained from the Yahoo! 
music service. A full description of the dataset is given in [3] . 
There are two main factors which make the prediction task 
challenging. Firstly, the magnitude of the dataset is rather 
large: there are 1,000,990 users, 624,961 music items (songs) 
and 262,810,175 user ratings, spanning over 6649 time bins. 
For data of this magnitude, commonly used mathematical 
software like Matlab can not be efficiently deployed. Sec- 
ondly, the data includes additional features such as the time 
when the user ratings were recorded as well as the hierarchy 
of rated items to genres (each rated song can belong to one 
or more genre), album and artist. 

In this paper we describe how we handled the two chal- 
lenges described above. Section [2] outlines the theoretical 



algorithms used for computing the prediction. Section [3] 
explains how those algorithms where adapted to the KDD 
CUP contest, namely accounting for hierarchy of data items. 
Section [4] discusses our efficient custom parallel implemen- 
tation on top of the GraphLab machine learning framework, 
that was used to rapidly fine-tune multiple algorithm param- 
eters, including report of performance results. We conclude 
in Section 5. 

As an additional contribution, we release open source code 
of many of the implemented algorithms as part of GraphLab's 
collaborative filtering library - available from 
http : //graphlab . org/ 

2. ALGORITHMS 

Inspired by the Bellkor team's algorithm which won the 
Netflix contest [6], we deployed an ensemble method, com- 
bining a collection of collaborative filtering algorithms while 
blending the solutions together. The ensemble comprises of 
12 methods listed in Table[l] of which the last two are novel. 
In the rest of this section we describe the implemented al- 
gorithms in more detail. 
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Table 1: Different algorithms implemented. The last 
two are our novel contribution. 



2.1 Neighborhood models 

An item-based neighborhood approach predicts the rating 
r U i of a user u for a new item i, using the rating of the user 
u gave to the items which are similar to i. We choose the 
Adjusted Cosine (AC) similarity to measure the similarity 
and j. 



Wij between item i 



Here, 



Uij denotes the users who have rated both item i 



and j. Based on the similarity, for every item i, we can 
compute the neighborhood Ni which contain the K items 
most similar to i. Then we can predict r u i based on the 
items in both Ni and R u which is the set of ratings made 
by user u: 



Here, n denotes the intersection of two sets. 

To address the computational challenges arising from the 
huge number of items, we split the items into N parts. For 
each ith iteration, we only need to compute the neighbors 
of the items in the ith part. In our experiments, we set 
N = 300, thus matrix Mj x j fitted into a 8GB memory 
computer, here J = I/N and I is the number of items. This 
method can be easily parllelized. 

2.2 ALS and BPTF 

Alternating least squares [16] is a simple matrix factoriza- 
tion algorithm. The non-zero rating of item form a matrix 
A of size M x N, where M number of users and N in the 
number of items. The matrix A is decomposed into two low 
rank matrices A ~ U * V where U is of size M x D and V 
is D x N. Starting from an initial guess, each iteration first 
fixes U and computes V using a least squares procedure, 
then fixes V and computes U using the same least square 
procedure. The rating is computed as a vector product of 
the matching user and item feature vectors: 



i(t) 



E^ 

i=i 



ALS model can be extended to the tensor case where time 
information is included with the rating. Bayesian proba- 
bilistic tensor factorization (BPTF) [l5] is a Markov Chain 
Monte Carlo method, where on top of the least squares step, 
sampling from the hyperpriors of U, V is added. 

2.3 SGD 

Matrix Factorization methods have demonstrated supe- 
rior performance vs. neighborhood based models [§]. Ma- 
trix factorization models map both users and items to a joint 
latent factor space of dimension D, such the user-item inter- 
actions are modeled as inner products in that space. Each 
item i and user u is associated with a D-dimensional latent 
feature vector qi and p u respectively. Thus predicted rating 
is computed by: 



H + bi 



bu + qfpu ■ 



The parameters bi , b u , qi and p u are learned by minimizing 
a certain loss function based on the (u,i) pairs in the set of 



observed ratings O: 

min i r ui - fmf + X(bi + b\ + \\q t \\ 2 + \\p u \\ 2 ) (1) 

(u,i)eo 

where ||.|| 2 denotes the Frobenius 2-norm and the positive 
constant A controls the extent of regularization and it is 
determined by cross validation. We used stochastic gradi- 
ent descent optimization to minimize the loss function |T|. 
The complexity of each iteration is linear in the number of 
ratings. 

2.4 SVD++ 

Implicit feedback can improve the prediction accuracy 
since it provides an additional indication of user preferences. 
SVD++ [5] is an extension of the linear model of |T]). For 
each item i, we add an additional latent factor yi. Thus, the 
latent factor vector of each user u can be characterized by 
the set of items the user have rated. The exact model is as 
follows: 



= fl + bi + b u + (Pu + \Ru 
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Similarly, we can learn the parameters bi, b u , qi, p u and yi 
using stochastic gradient descent optimization to minimiz- 
ing the quadratic loss function. Again, the complexity per 
iteration is linear in the number of ratings. 

2.5 Time-aware neighborhood models 

In the time-aware cf model, each rating r U i is associated 
with a time stamp t u i, which indicates the time when the 
rating was observed. However, a rating r U i observed 3 years 
ago is less important as a rating r U j taken 3 days ago, when 
used to predict the current ratings. 

Following [7], we define a time-decay function to model 
this effect: 

Ui(t) = e"«*-^ , 

where /3 > controls the decaying rate. When /3 = 0, we 
don't consider the temporal effects. 

We can incorporate the temporal effect into the neighbor- 
hood models as follows: 



r u i(t) = 



j£R u nN, 



fui [fylVijVvi 
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In our experiments, we found that setting /3 = 0.08 gave 
the best performance. Overall, time-aware neighborhood 
model achieved a significantly better result than time-independent 
neighborhood model. 

2.6 Time-aware matrix factorization 

Like in [6], we model the temporal effect into the Matrix 
Factorization by allowing the parameters vary with different 
time. In particular, we quantize the user-based time effect 
by days Ti,T2, ...,Tjv and add a item-time-bin bias to the 
equation. The predicted rating is computed as follows: 

f U i(t) = fJ-+b l +b u +b Utt +bi }Btn(t) +q'[ (p u + \R u \^ L/2 ^ y 3 ) . 

Here, b — Bin(t),b — 0, 1, ...,N, we choose N = 30 as the 
number of time bins. b u .t is a 2-dimensional array factorized 
by b Uy t = x^zt to cut memory cost: 



r U i(t) = H+h+bu+x^ z t +b l:BinW +qf (p u + \R u 



-1/2 



Hence, we could extend the time-independent loss func- 
tion to the following form: 

min ^2 (r uit - r ui {t)f + Ai(6 2 + b\. + b 2 ]Si „(t)) + 
(u,i,t)eo 

+Hu\\ 2 + \\Pu\\ 2 + J2 ii%-ii 2 ) + A 3(iwi 2 + ii^ii 2 ) • 

Similarly, the model described in the section [3] can be ex- 
tended to a time-aware version. 

r u i(t) = H + bi+b u + b ar + XuZt + &i,Sin(t) + (?< + Qar) 7 Pu ■ 

2.7 Random Forests 

The above techniques largely make use of only rating in- 
formation. To utilize the remaining information such as "al- 
bum" and "artist info", we used random forests to perform 
regression of all item features. We consider this prediction 
task as to estimate a function x u ,: — > r«j. Here, x„i is a 
D-dimensional vector which maps to the features of a (u, i) 
pair and r u i denotes the rating. In our experiments, the fea- 
tures included the user id, item id, artist, album and genres. 
We use random forests [2] to regress the features. The sin- 
gle model did not perform as well as traditional Collabora- 
tive Filtering algorithms (obtained RMSE 26 on validation 
dataset), but we found it can improve our final solution af- 
ter blending with other models (obtaining an improvement 
of 0.08 on the leaderboard). 

2.8 Blending multiple solutions 

A lesson learned from the Netflix contest is that the com- 
bination of different algorithms can lead to significant per- 
formance improvement over individual algorithms [l4]. We 
blended our multiple predictors based on a linear regression 
model. We use the validation set to compute the optimal 
Af-dimensional linear combination weights w, where M is 
the number of blending models. 

First, we trained each model nij independently based on 
the training set and get the predictions of the N ratings 
in validation set, where Xi is a iV-dimensional vector. The 
target value for the N data points are y, the weights w 
can be obtained by solving a least squares problem. In our 
solution, we used ridge regression: 

w= (X T X + AI) _1 Xy, 

where I denotes the identity matrix and the regularization 
A is determined by cross validation [5]. 

Next, every model is trained again using the same ini- 
tialization parameters, but training is now performed using 
both the training set and the validation set. Finally, test set 
predictions of each model m; is computed (x^) and the final 
prediction is obtained: 

r = X w . 

3. MFITR: MATRIX FACTORIZATION WITH 
ITEM TAXONOMY REGULARIZATION 

A unique property of the KDD data, is that tracks, al- 
bums, artists and genres form a hierarchy; where each track 
belongs to an album, each albums belongs to an artist, and 
both are tagged by genres [3]. We propose MFITR, a novel 
method to utilize item taxonomy information to improve 
prediction accuracy. 



To capture the hierarchy of items, we construct a graph 
between tracks, albums and artists. (We did not use genre 
information.) The method is different from the traditional 
matrix factorization since we model the item hierarchy as 
a regularization term to constrain the matrix factorization 
computation, as explained next. A closely related approach 
is recently proposed in the social network domain [TO] . 

We assume an object i is a parent of another object j if 
there exists a hierarchic relationship between them and j 
belongs to i, meanwhile, j can be seen as a child of i. The 
root nodes of the hierarchic relationship are artists, and the 
children nodes are albums. Tracks belonging to an album 
are children of the album. This item hierarchy is therefore a 
graph and we denote the parent set of item i as Pi and the 
child set as Ci. 

If a user u have given an artist 04 a rating 100 and given 
another artist a,j a rating 0, we can intuitively know that 
it will like tracks and albums of Oi more than that of aj. 
Based on this intuition, we propose a model based on matrix 
factorization. The prediction is computed by: 

r ul = H + bi + b u + b a + (qi + q a ) T p u . 

We use b a as the bias of the artist a and qaPu as the 
user feature vector for artist a, the performer of music item 
i. Furthrmore, to support different ratings of tracks in the 
same album, we propose a more advanced model: 

min ^2 ( r ui - H - bi - b u - b a - (qi + q a ) T Pu) 2 + 
(u,i)EO 

+Ai(5f + b 2 u + b 2 a ) + Aa(||»|| a + IWI 2 + ||«?a|| 2 ) + 

i JSP; i jGCi 

Here, Wij is the similarity between i and j, computed as in 
the neighborhood model. If i and j are similar, the distance 
between qi and qj shouldn't be large. Table [2] summarizes 
the notations used in IMFTR. 
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Table 2: MFITR Notations 



The MFITR cost function is composed of four terms. The 
first term minimizes the Euclidean distance between the ob- 
served and predicted rating, where biases of user item and 
artist are taken into account. The second term is a regu- 
larization term of the biases to prevent over-fitting. The 



third and fourth term enforce similarity between tracks in 
the same album and between albums of the same artist, 
when they are rated closely by the neighborhood model. 

An advantage of this approach is that the similarity be- 
tween parent a nd children can propagate indirectly in the 
learning phase [To]. For example, two tracks t a and tj, from 
the same album I have no direct relationship between them. 
However, since they have the same parent I, the distance be- 
tween qt a and qt b is actually minimized indirectly when the 
distances Wz,t a \\qi — qt a || 2 and wi, tb \\qi — qt b || 2 are minimized. 

An immediate extension we implemented is to add time 
information into the cost function as done in time-SVD+- K 
We call this variant time-MFITR. As shown in the next 
section, time-MFITR has very good performance on KDD 
data. 

4. EFFICIENT MULTICORE IMPLEMEN- 
TATION 

A majority of the algorithms described where implemented 
on top of the GraphLab parallel machine learning framework 
(9] . We selected GraphLab since it allowed us for rapid pro- 
totyping and testing of multiple CF algorithms. The fol- 
lowing algorithms where implemented: ALS, weighted- ALS 
(wALS), SVD++, PMF, BPTF and SGD. Since we used 
multiple algorithms, where each algorithm had multiple tun- 
able parameters that needed to be adjusted, an efficient par- 
allel solution was essential for rapidly improving our model. 
All of the above algorithms are open sourced as part of the 
GraphLab collaborative filtering library: 
http : / /graphlab . org/ 

We have utilized our own cluster (several AMD Opteron 
8387 4-8 core machines, 2.7Ghz, 16-64GB memory) as well 
as the BlackLight [I] supercomputer (SGI UV 1000 NUMA 
shared-memory system comprising 256 blades. Each blade 
holds 2 Intel Xeon X7560 Nehalem 2.27 Ghz eight-core pro- 
cessors, for a total of 4096 cores.) Overall, we estimate that 
we have used around 10,000 cpu hours on our clusters and 
10,000 cpu hours on BlackLight. Each algorithm was run in 
parallel using 8-32 cores using line search for each tunable 
parameter. Each of those runs was repeated twice: with 
and without validation data used for training, as explained 
in Section [231 

4.1 Performance results 

Table [3] lists the different tunable parameters we tested 
for each algorithm, and the optimized setting we found. As 
a baseline for performance, we measure RMSE (root mean 
square error) on the validation dataset. The most effective 
single algorithm is time-SVD++ which obtained RMSE of 
20.90 on the validation data. The second most effective 
single algorithm is our novel time-MFITR algorithm which 
obtained RMSE of 21.10 on the validation data. Note that 
while wALS obtained the best performance, it did overfit 
and gave worse performance on the actual test data. A 
summary of the results is given in Figure [l] 

Regarding performance of the parallel implementation. 
Figure[2]shows the speedup of several algorithms using Black- 
Light. Similar results were obtained also on the Linux clus- 
ter and will not be repeated here. Speedup is defined using 
the baseline of a single CPU run. For wALS,ALS we obtain 
an almost optimal speedup of xl4 on 16 cores. BPTF per- 
forms slightly less since it has a sampling step after each it- 




Figure 1: RMSE of the different algorithms on the 
validation data. 

eration which is serial and slows the algorithm a little. SGD, 
SVD++ performance is worse - with a speedup of about x6 
and x3, respectively. The reason is that we deploy a locking 
mechanism to prevent users to update the same item feature 
vector concurrently. 

Regarding accuracy of the parallel computation vs. an 
equivalent serial result. Figure [3] examines the validation 
RMSE of 5 iterations the SGD algorithm (D=50) using dif- 
ferent number of cores 1-16 on BlackLight. Because of the 
parallel implementation there are slight variations in accu- 
racy. However, variations are not more than 0.1% of the 
serial result. Similar behavior was observed for the other 
algorithms. 

Another interesting question we looked at is how does 
computation scale with the length of the feature vector. 
Figure [4] shows the good scaling of SGD algorithm. This 
scaling was also observed for SVD++. Both algorithms per- 
formance is almost linear with the number of features. ALS, 
wALS and BPTF all perform matrix inversion as part of the 
update rule and thus the scaling is less good. 

Figure [5] depicts running time of a single iteration of sev- 
eral algorithms, on 16 cores with D = 20. SGD and SVD+- 1- 
(not shown, but has similar running time as SGD) are the 
fastest algorithms per iteration. But from the other hand, 
it is more difficult to make them work efficiently in parallel. 




5 10 15 

Number of cores 



Figure 2: Speedup of Graphlab using KDD data on 
BlackLight with up to 16 cores. 
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19.90 



Table 3: Main results on the validation data obtained using the different algorithms. 
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Figure 3: Accuracy of SGD validation RMSE using 
different number of cores. Variations are up to 0.1%. 



5. CONCLUSION 

We have utilized the GraphLab parallel machine learning 
framework to efficiently and rapidly implement multiple col- 
laborative filtering algorithms, having a fast way of testing 
multiple model settings allowed us for efficient blending of 
multiple algorithms together. We have further introduced 
a novel algorithm called MFITR for accounting for item 
taxonomy. Using fast multicore implementation of multi- 
ple algorithms as well combining solution of our MFITR 
algorithm allowed us to achieve the 5 th place at track 1 of 
the ACM KDD CUP 2011 contest. 
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Figure 4: SGD iteration time vs. feature vector 
width (D). 
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